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1  Introduction 

Our  world  contains  an  overwhelming  variety  of  objects. 
While  people  demonstrate  outstanding  abilities  to  mem¬ 
orize  and  recognize  thousands  of  objects  [27,  37.  38], 
computer  vision  applications  largely  fail  to  accommodate 
these  numbers.  Apparently,  the  main  tool  that  enables 
people  to  effectively  handle  this  massive  amount  of  ob¬ 
jects  is  categorization.  By  dividing  the  observed  objects 
into  classes,  the  visual  system  is  capable  of  concluding 
properties  of  unfamiliar  objects  from  their  resemblance 
to  ftimiliar  ones.  For  familiar  objects,  categorization  of¬ 
fers  an  indexing  tool  into  the  stored  librziry  of  object 
representations. 

Recognition  can  be  performed  in  different  “levels  of 
abstraction”.  For  example,  the  same  object  can  be  rec¬ 
ognized  as  a  face,  a  human  face,  or  as  a  specific  person’s 
face.  Psychological  studies  suggest  the  existence  of  a  pre¬ 
ferred  level  for  recognition,  called  “the  basic  level  of  ab¬ 
straction”  [33].  Existing  computational  schemes  usually 
approach  recognition  in  either  one  of  two  levels.  Several 
schemes  attempt  to  classify  objects  in  their  basic  level 
of  abstraction  (we  refer  to  this  task  by  categorization), 
while  other  schemes  attempt  to  determine  the  specific 
identity  of  objects  (we  refer  to  this  task  by  identifica¬ 
tion).  This  paper  presents  a  novel  approach  for  recogni¬ 
tion  that  combines  the  two  tasks. 

To  see  how  the  two  tasks  are  related,  consider  the  fol¬ 
lowing  example.  Suppose  you  are  walking  down  a  street, 
and  someone  is  coming  towards  you.  You  look  at  the 
person’s  face,  and  it  looks  familiar,  but  you  cannot  tell 
who  it  is.  So  you  try  to  picture  the  people  you  know  who 
look  like  the  person  you  see,  until  finally,  you  realize  who 
the  person  is. 

A  number  of  hypotheses  can  be  drawn  from  this  story. 
First,  recognition  can  be  broken  into  two  stages:  cat¬ 
egorization  and  identification,  where  categorization  is 
believed  to  precede  identification.  Second,  during  the 
course  of  recognition  the  image  is  compared  against  a 
number  of  object  models.  Assuming  that  indeed  catego¬ 
rization  precedes  identification,  only  models  that  belong 
to  the  object’s  class  need  to  be  considered.  Finally,  when 
a  new  model  is  compared  to  the  image,  the  comparison 
process  may  benefit  from  the  use  of  information  acquired 
during  categorization.  Note  that  the  situation  described 
here  is  not  specific  to  faces.  One  can  imagine  that  simi¬ 
lar  situations  occur  when  other  objects,  such  as  animab, 
cars,  and  chairs,  are  observed. 

To  see  how  information  acquired  during  categoriza^ 
tion  can  be  used  for  identification,  consider  the  example 
of  face  recognition.  When  a  face  is  recognized,  the  image 
positions  of  its  parts  and  features  are  known.  In  pcirtic- 
ular,  an  observer  already  knows  where  the  eyes,  nose, 
and  mouth  are  and  can  even  infer  the  direction  of  gaze 
and  expression.  The  person’s  identity  is  not  essential  for 
extracting  and  locating  these  features.  Instead,  they  are 
matched  against  features  in  a  “generic”  representation. 
In  addition,  other  features,  such  as  a  beard,  hair  style, 
and  wrinkles,  that  may  better  distinguish  between  dif¬ 
ferent  persons  may  be  located.  More  generally,  we  can 
postulate  that,  during  categorization,  sub-structures  of 
the  objects  (such  as  parts  and  features)  are  extracted 


and  located  with  respect  to  a  generic  model,  ami  the 
object’s  pose  is  determined. 

To  follow  this  example,  1  propose  a  scheme  for  recog¬ 
nizing  3D  objects  from  single  2D  views  that  combines 
the  two  stages,  categorization  2ind  identification.  Cat¬ 
egorization  is  achieved  by  aligning  the  image  to  proto¬ 
type  objects.  The  prototype  that  appears  most  similar 
to  the  image  determines  the  cltiss  identity  of  the  object. 
After  the  object  is  categorized,  its  specific  identity  is  de¬ 
termined  by  aligning  the  observed  object  to  individual 
models  of  its  class.  By  first  categorizing  the  object,  not 
only  the  number  of  models  considered  for  identification 
is  reduced,  but  also  the  cost  of  comparing  each  model 
to  the  image  significantly  decreases.  This  is  achieved  by 
reusing  the  correspondence  and  pose  computed  for  the 
prototype  in  the  categorization  stage  to  align  the  image 
with  the  individual  models.  We  show  in  this  paper  that, 
albeit  a  perfect  match  between  the  prototype  and  the 
image  is  not  obtainable,  the  correspondence  and  pose 
can  be  computed  for  the  prototype,  and  can  be  used 
to  bring  the  image  and  the  object’s  model  into  align¬ 
ment.  Consequently,  recovering  the  correspondence  and 
pose  for  the  individual  models  becomes  unnecessary,  rmd 
identification  is  reduced  to  a  series  of  simple  template 
comparisons. 

The  rest  of  this  paper  is  divided  as  follows.  Section  2 
reviews  the  main  existing  approaches  for  categorization 
and  identification.  Section  3  presents  the  scheme  of 
recognition  by  prototypes.  Section  4  proposes  an  algo¬ 
rithm  for  generating  optimal  prototypes  for  the  scheme. 
Section  5  discusses  the  relevance  of  the  scheme  to  hu¬ 
man  recognition.  Implementation  results  are  presented 
in  Section  6. 

2  Previous  Approaches 

Existing  schemes  for  categorization  often  use  a  “reduc¬ 
tionist”  approach.  The  image,  which  contains  a  detailed 
appearance  of  an  object,  is  transformed  into  a  compact 
representation  that  is  invariant  for  all  objects  of  the 
same  class.  One  common  approach  to  generating  such  a 
representation  is  by  decomposing  the  object  into  parts. 
Parts  are  extracted  by  cutting  the  object  in  concavities 
(17,  22,  43]  and  labeled  according  to  their  general  shape. 
The  labels,  together  with  the  spatial  relationships  be¬ 
tween  the  parts,  are  used  to  identify  the  class  of  the  ob¬ 
ject  [4,  6,  7,  26].  A  second  approach  extracts  the  parts  of 
the  object  that  fulfill  certain  functions.  The  list  of  func¬ 
tions  is  used  to  determine  the  object’s  class  [16,  39,  47]. 

Schemes  that  break  objects  into  parts  are  insufficient 
to  explain  all  the  aspects  of  recognition  for  the  following 
reasons.  First,  in  mauiy  cases  objects  that  belong  to  the 
same  class  differ  only  by  their  detailed  shape,  while  they 
share  roughly  the  same  set  of  parts.  Moreover,  even  ob¬ 
jects  that  at  some  level  may  be  considered  belonging  to 
different  classes,  such  as  a  cat  and  a  dog,  may  also  share 
roughly  the  same  set  of  parts.  To  solve  this  problem  sev¬ 
eral  systems  also  store,  in  addition  to  the  part  structure 
of  the  objects,  the  detailed  shape  of  the  parts  [2,  6,  7]. 
Another  problem  is  that  many  of  the  techniques  for  rec¬ 
ognizing  objects  by  part  decomposition  rely  on  finding 
the  entire  parts  from  the  image. 
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To  recognize  the  specific  identity  of  objects,  a  rela¬ 
tively  detailed  representation  of  the  object’s  shape  is 
compared  with  the  image.  An  example  for  such  methods 
is  alignment  [3,  9,  12,  13,  18,  23,  25,  40,  41],  Alignment 
involves  recovering  the  position  and  orientation  (pose)  in 
which  the  object  is  observed  and  comparing  the  appear¬ 
ance  of  the  object  from  that  pose  with  the  image.  Only 
a  few  attempts  have  been  made  in  the  past  to  extend 
the  alignment  scheme  to  the  problem  of  object  catego¬ 
rization  (e.g.,  [36]).  The  main  difficulty  in  applying  the 
alignment  approach  is  the  recovery  of  the  pose  of  the 
observed  object.  In  most  implementations  this  involves 
a  time-consuming  stage  for  finding  the  correspondence 
between  the  model  and  the  image.  The  process  becomes 
impractical  when  the  image  is  compared  against  a  large 
library  of  objects,  because  typically  the  correspondence 
is  established  between  the  image  and  each  of  the  models 
in  the  library  separately. 

To  handle  large  libraries,  indexing  methods  were  pro¬ 
posed  (e.g.,  [20,  46,  14]).  The  basic  idea  is  the  following. 
A  certain  function  is  defined  and  applied  to  the  views  of 
all  the  objects  in  the  library.  The  object  models  are  ar¬ 
ranged  in  a  look-up  table  indexed  by  the  obtained  func¬ 
tion  values.  When  an  image  is  given,  the  function  is 
applied  to  the  image,  and  the  obtained  value  is  used  to 
index  into  the  table.  To  reduce  the  size  of  the  table  and 
the  complexity  of  its  preparation,  invariant  functions, 
functions  that  when  apphed  to  different  views  of  an  ob¬ 
ject  return  the  same  value  regardless  of  viewpoint,  often 
are  used  as  the  indexing  functions. 

Indexing  methods  suffer  from  several  shortcomings. 
First,  existing  indexing  methods  handle  only  rigid  ob¬ 
jects.  Extending  these  methods  to  handle  classes  of  ob¬ 
jects  has  not  been  discussed.  Second,  because  of  com¬ 
plexity  issues,  indexing  functions  usually  are  applied  to 
small  numbers  of  features.  As  a  result,  high  rates  of 
false  positives  are  obtained,  and  the  effectiveness  of  the 
indexing  is  reduced. 

The  scheme  presented  in  this  paper  is  designed  to 
work  where  traditional  approaches  to  categorization  and 
indexing  fail.  The  scheme  combines  both  categorization 
and  identification  of  objects,  and  uses  fairly  detailed  rep¬ 
resentations  for  objects.  Rather  than  indexing  directly 
to  the  specific  object  model,  the  scheme  indexes  into 
the  library  of  objects  by  categorizing  the  object.  The 
classes  handled  by  the  scheme  include  objects  with  rel¬ 
atively  similar  shapes.  To  fit  into  the  scheme,  in  some 
cases  basic  level  classes  are  broken  into  sub-classes.  The 
general  problem  of  categorization  therefore  may  require 
additional  tools. 

3  Recognition  by  Prototypes 

The  recognition  by  prototypes  scheme  proceeds  as  fol¬ 
lows.  A  librciry  of  3D  object  models  is  stored  in  mem¬ 
ory.  The  models  in  the  library  are  divided  into  classes, 
and  3D  prototype  objects  are  selected  to  represent  the 
classes.  For  every  class,  the  models  in  the  class  are 
aligned  in  the  library  with  the  prototype  object.  The 
role  of  this  3D  alignment  will  become  clear  shortly. 

At  recognition  time,  an  incoming  2D  image  is  first 
matched  Jigainst  all  of  the  prototypes.  For  each  proto¬ 


type  object,  the  system  attempts  to  recover  tlie  view  of 
the  prototype  that  most  resembles  the  imagt  To  clt)  so 
the  system  recovers  the  correspondence  bet  wei  n  tlie  pro¬ 
totype  and  the  image,  amd,  using  this  correspondence,  it 
determines  the  transformation  that  best  aligns  the  pro¬ 
totype  with  the  image.  This  transformation,  referred 
to  £is  the  prototype  transform,  is  then  applied  to  the 
prototype,  and  the  similarity  between  the  transformed 
prototype  and  the  actual  image  is  evaluated.  Since  the 
observed  object  in  general  differs  from  the  prototype  ob¬ 
ject,  a  perfect  match  between  the  two  is  not  anticipated. 
The  system  therefore  seeks  a  prototype  that  reasonably 
matches  the  image.  Once  su:h  a  prototype  is  found,  the 
class  identity  of  the  object  is  determined. 

After  the  object’s  class  is  determined,  the  system  at¬ 
tempts  to  recover  the  specific  identity  of  the  object.  At 
this  stage,  the  image  is  matched  against  all  the  models 
of  the  object’s  class.  For  each  of  these  models,  the  sys¬ 
tem  seeks  to  recover  the  transformation  that  aligns  the 
model  with  the  image.  As  will  be  shown  below,  since  the 
models  are  aligned  in  the  library  with  the  prototype,  the 
transformation  that  best  aligns  the  prototype  with  the 
image  is  identical  to  the  transformation  that  aligns  the 
model  to  the  image  The  prototype  transform  therefore 
is  applied  to  the  specific  models,  and  their  appearance 
from  this  pose  is  compared  with  the  image.  The  model 
that  aligns  with  the  image,  if  there  is  such,  determines 
the  specific  identity  of  the  object. 

The  rest  of  this  section  is  divided  as  follows.  In  Sec¬ 
tion  3.1  the  object  representation  used  in  our  scheme  is 
presented.  Section  3.2  describes  the  categorization  stage, 
and  Section  3.3  describes  the  identification  stage. 

3.1  Object  representation  -  the  linear 
combination  scheme 

In  our  scheme,  an  object  is  modeled  by  a  matrix  M 
of  size  n  X  k,  where  n  is  the  number  of  feature  points, 
and  k  represents  the  degrees  of  freedom  of  the  object. 
A  vector  5  £  72.*’,  referred  to  as  the  transform  vector, 
represents  the  transformation  applied  to  the  object  in  a 
certain  view,  and  the  object’s  appearance  from  this  view 
is  given  by 

t?  =  MS  (1) 

In  the  rest  of  this  section  we  explain  the  use  of  this  nota¬ 
tion.  The  notation  follows  from  the  linear  combination 
scheme  [42],  which  is  briefly  reviewed  below. 

Under  the  linear  combination  scheme  an  object  is 
modeled  by  a  small  set  of  views,  each  is  represented 
by  a  vector  containing  point  positions,  where  the  points 
in  these  views  are  ordered  in  correspondence.  Novel 
views  of  the  object  are  obtained  by  applying  linear  com¬ 
binations  to  the  stored  views.  Additional  constraints 
may  apply  to  the  coefficients  of  this  linear  combination. 
Computing  the  object  pose  therefore  requires  recovering 
the  coefficients  of  the  linear  combination  that  align  the 
model  with  the  image  and  verifying  that  the  recovered 
coefficients  indeed  satisfy  the  constraints.  The  method 
handles  rigid  objects  under  weak-perspective  projection 
(namely,  orthographic  projection  followed  by  a  uniform 
scaling) .  It  was  extended  to  approximate  the  appearance 
of  objects  with  smooth  bounding  surfaces  and  to  handle 
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articulated  objects.  In  our  representation,  the  columns 
of  the  model  matrix  M  contain  views  of  the  object,  and 
the  coefficients  of  the  linear  combination  that  align  the 
model  with  the  image  are  given  by  the  transform  vector 
a. 

For  concreteness,  we  review  the  linear  combination 
scheme  for  rigid  objects.  Consider  a  3D  object  O  that 
contains  n  feature  points  {Xi,Yi,  Zi),  1  <  «  <  n.  Under 
weak-perspective  projection,  the  position  of  the  object 
following  a  rotation  R,  translation  t,  and  scaling  s  is 
given  by 


X, 

Vi 


srn^t  +  srnVj  -t-  srisZi 
*^2!  A'i  +  sr22Yi  -I-  sr23Zi  +  ty 


(2) 


where  r,j  are  the  components  of  the  rotation  matrix,  R, 
and  iy  are  the  horizontal  and  vertical  components  of 
the  translation  vector,  t  respectively. 

Denote  by  X  ,Y  .Z,x,y  G  72"  vectors  of  Xi,Yi,  Zi,Xi 
and  y,  values  respectively,  and  denote  1  =  (1,...,  1)  G 
72" ,  we  can  rewrite  Eq.  2  in  a  vector  equation  as  follows: 


X  =  ai  X  +  a2Y  +  aaZ  +  a4l 
y  =  biX  +  b2Y  +  b^Z  +  b^l 

where 

ai  =  srii 
02  =  sri2 

<*3  =  S^13 

(14  —  lx 

Therefore 

x,y  €  span{X,  9,2,1}  (4) 

Different  views  of  the  object  are  obtained  by  chang¬ 
ing  the  rotation,  scale,  and  translation  parameters,  and 
these  changes  result  in  changing  the  coefficients  in  Eq.  3. 
We  may  therefore  conclude  that  all  the  views  of  a  rigid 
object  cire  contained  in  a  4D  linear  space. 

This  property,  that  the  views  of  a  rigid  object  are 
contained  in  a  4D  linear  space,  provides  a  method  for 
constructing  viewer-centered  representations  for  the  ob¬ 
ject.  The  idea  is  to  use  images  of  the  object  to  construct 
a  basis  for  this  space.  In  general,  two  views  provide  suf¬ 
ficiently  many  vectors.  Therefore,  any  novel  view  is  a 
linear  combination  of  two  views  [30,  42]. 

Not  every  linccir  combination  is  a  valid  view  of  a  rigid 
object.  Following  the  orthonormality  of  the  row  vectors 
of  the  rotation  matrix,  the  coefficients  in  Eq.  3  must 
satisfy  the  two  quadratic  constraints 


61  =  sr2i 

62  =  51*22 

63  =  sr23 

64  —  ty 


“1  +  «2  +  “3  =  +  *2  +  *3 

aibi  a2f>2  +  03^3  =  0 


When  the  constraints  are  not  satisfied,  distorted  (by 
stretch  or  shear)  pictures  of  the  objects  are  generated. 
In  case  a  viewer-centered  representation  is  used,  the  con¬ 
straints  change  in  accordance  with  the  selected  basis.  A 
third  view  of  the  object  can  be  used  to  recover  the  new 
constraints. 

For  the  purpose  of  this  paper  a  model  for  a  rigid  object 
can  be  constructed  by  building  the  following  n  x  4  model 
matrix 

M  =  (X,  9,  Z,l) 


Views  of  the  object  can  be  constructed  as  follow.^ 


X  =  M  a 
y  =  Mb 


(b) 


where  a  =  (01.02,03,04)  and  6  =  (61 , 62- f'a.  f>4 )  are  the 
coefficients  from  Eq.  3.  Notice  that  the  two  linear  sys¬ 
tems  can  oe  merged  into  one  by  constructing  a  modified 
model  matrix  in  the  following  way 


Similar  constructions  can  be  obtained  for  objects  with 
smooth  bounding  surfaces  and  for  articulated  objects. 
The  width  oi  M,  k,  should  then  be  modified  according 
to  the  degrees  of  freedom  of  the  modeled  object.  As  was 
mentioned  above,  viewer-centered  representations  can  be 
obtained  by  constructing  a  basis  for  the  4D  space  from 
images  of  the  object.  Therefore,  viewer-centered  models 
can  be  obt2iined  by  replacing  the  column  vectors  of  M 
with  the  constructed  basis. 

To  summarize,  following  the  linear  combination 
scheme  we  can  represent  an  object  by  a  matrix  M  and 
construct  views  of  the  object  by  applying  it  to  trans¬ 
form  vectors  a.  For  rigid  objects  not  every  transform 
vector  is  valid;  the  components  of  the  transform  vector 
must  satisfy  the  two  quadratic  constraints.  Recognition 
involves  recovering  the  transform  vector  a  and  verifying 
that  its  components  satisfy  the  two  constraints.  Ignor¬ 
ing  these  constraints  will  result  in  recognizing  the  object 
even  when  it  undergoes  general  3D  affine  transformation. 
In  the  analysis  below  we  largely  ignore  the  quadratic 
constraints.  These  constraints,  however,  can  be  verified 
both  during  the  categorization  stage  as  well  as  during 
the  identification  stage. 

3.2  Categorization 

The  recognition  by  prototypes  scheme  begins  by  deter¬ 
mining  the  object’s  category.  This  b  achieved  by  com¬ 
paring  the  observed  object  to  prototype  objects,  objects 
that  are  “typical  exemplars”  for  their  classes.  For  a  given 
prototype,  the  view  of  the  prototype  that  most  resem¬ 
bles  the  image  is  recovered  and  compared  to  the  actual 
image,  and  the  result  of  this  comparison  determines  the 
class  identity  of  the  object. 

We  begin  our  description  of  the  categorization  stage 
by  defining  the  data  structures  used  by  the  scheme.  A 
class  C  =  (P,  {Ml,  M2, ...,  M[})  is  a  pair  that  includes  a 
prototype  P  and  a  set  of  object  models  Mi ,  M2, ...,  M(. 
Both  the  prototype  2ind  the  models  are  represented  by 
n  X  k  matrices,  where  n  defines  the  number  of  feature 
points  considered,  and  k  denotes  the  degrees  of  freedom 
of  the  objects.  For  the  sake  of  simplicity  we  assume  here 
that  all  the  objects  share  the  same  number  of  feature 
points,  n,  and  that  they  have  similar  degrees  of  freedom, 
jfc.  Note  that  similar  objects  tend  to  have  simile  degrees 
of  freedom  (e.g.,  all  of  them  are  rigid).  Both  assumptions 
are  not  strict,  however.  The  scheme  can  be  modified  to 
tolerate  both  varying  number  of  feature  points  as  well 
as  different  degrees  of  freedom.  The  details  will  be  dis¬ 
cussed  later  in  this  paper.  Note  that  the  objects  can 
be  modeled  by  either  object-centered  or  viewer-centered 
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representations.  In  case  viewer-centered  representations 
are  used  we  shall  assume  that  the  models  represent  the 
objects  from  the  same  range  of  viewpoints. 

A  claiss  in  our  scheme  contains  objects  with  similar 
shapes.  These  objects  share  roughly  the  same  topolo¬ 
gies,  and  there  exists  a  “natural”  correspondence  be¬ 
tween  them.  Consider,  for  instance,  the  two  chairs  in 
Figure  1.  Although  the  shapes  of  these  chairs  are  dif¬ 
ferent,  and  some  parts  (e.g.,  the  arms)  appear  only  in 
one  chair  and  not  in  the  other,  a  naturad  correspondence 
between  features  in  the  two  objects  can  be  determined. 

In  the  library  of  models,  the  natural  correspondence 
between  objects  is  made  explicit.  It  is  specified  by  the 
order  of  the  row  vectors  of  the  models.  Specifically,  given 
a  prototype  P  and  object  models  Mi,...,  Mi,  we  order 
the  rows  of  these  models  such  that  the  first  feature  point 
of  P  corresponds  to  the  first  feature  point  of  each  of  the 
models  M\,...,Mi,  and  so  forth. 

Given  the  library  of  objects  and  given  an  incoming  im¬ 
age,  the  recognition  by  prototypes  scheme  begins  by  cat¬ 
egorizing  the  object  observed  in  the  image.  To  achieve 
this  goal,  the  prototype  objects  are  aligned  and  com¬ 
pared  to  the  image.  For  every  prototype,  the  correspon¬ 
dence  between  the  image  and  the  prototype  is  first  re¬ 
solved,  and,  using  this  correspondence,  the  nearest  pro¬ 
totype  view  is  recovered.  By  doing  so,  the  scheme  de¬ 
couples  the  two  factors  that  affect  the  appearance  of  the 
object  in  the  image,  namely,  view  variations  and  shape 
variations.  By  selecting  the  nearest  prototype  view  to 
the  image,  the  scheme  compensates  for  view  variations. 
Then,  by  evaluating  the  similarity  between  the  nearest 
prototype  view  and  the  actual  image,  it  accounts  for  the 
differences  in  shape  between  the  prototype  and  the  ob¬ 
served  object. 

The  first  stage  in  matching  the  prototype  to  the  image 
involves  the  recovery  of  correspondence  between  proto¬ 
type  and  image  features.  In  existing  systems  for  rec¬ 
ognizing  the  specific  identity  of  objects  establishing  the 
correspondence  between  images  and  object  models  in¬ 
volves  a  time-consuming  process  in  which  sophisticated 
algorithms  are  applied  [10,  13,  15,  18,  23,  25,  35,  41]. 
These  algorithms  rely  on  the  property  that,  when  the 
correct  correspondence  between  a  model  and  an  image  is 
established,  a  near-perfect  match  between  the  two  is  ob¬ 
tained.  While  this  assumption  is  valid  for  identification, 
it  cannot  be  used  under  our  scheme  since  the  prototype 
and  the  image  generally  represent  different  objects. 

To  determine  the  correspondence  between  the  proto¬ 
type  and  the  image,  we  define  an  objective  function  that 
is  applied  to  the  prototype  and  the  image  under  a  given 
correspondence  and  that  obtains  its  minimum  under  the 
correct  correspondence.  The  objective  function  will  mea¬ 
sure  the  quality  of  the  match  between  the  prototype  and 
the  image.  Namely,  under  this  measure  the  correct  cor¬ 
respondence  is  the  one  that  brings  the  prototype  into 
its  best  alignment  with  the  image.  Given  this  objective 
function,  correspondence  is  a  combinatorial  optimization 
problem,  and  so  minimization  techniques  can  be  used  to 
resolve  the  correspondence  between  the  prototype  and 
the  image.  This  paper  does  not  propose  a  specific  tech¬ 
nique  to  solve  the  correspondence  problem. 


Assuming  the  correspondence  problem  can  be  solved 
the  scheme  proceeds  as  follows.  Given  a  prototype  P 
and  an  image  /,  v/e  generate  a  view  vector  r  from  the 
image  by  extracting  the  location  of  feature  points  and 
arranging  them  in  a  vector.  The  points  in  r  are  ordered 
in  correspondence  to  the  prototype  points;  that  is.  the 
first  point  in  v  corresponds  to  the  first  point  in  P  and 
so  forth.  The  prototype  transform  is  the  transformation 
that  brings  the  prototype  points  as  close  as  possible  to 
their  corresponding  image  points.  The  prototype  trans¬ 
form,  therefore,  is  the  transform  vector  b  that  minimizes 
the  letist-squared  distauice  between  the  prototype  and 
image  points,  namely 

min  \\Pb'  —  iTl|  (8) 

b' 


A  solution  for  (8)  is  obtained  cis  follows.  Assuming  P 
is  overdetermined;  that  is,  P  is  n  x  k  where  n  >  and 
rank{P)  =  k,  and  denote  by  P"*"  =  [P^ P)~^  P^  the 
pseudo-inverse  of  P,  the  prototype  transform,  6,  is  given 
by 

b  =  P'^v  (9) 

and  the  nearest  prototype  view  p  is  obtained  by  applying 

P  to  the  prototype  transform,  6,  that  is 

p=P6=PP+r  (10) 

The  nearest  prototype  view  is  now  compared  to  the 
image,  and  their  resemblance  determines  the  class  iden¬ 
tity  of  the  object.  The  quality  of  the  match  between  the 
prototype  2Uid  the  image  is  defined  by 

Z)(P,u)  =  ))p-tr))  =  ))(PP+-/)r1)  (11) 


To  eliminate  effects  due  to  scaling  of  the  object,  this 
measure  should  be  normalized,  as  is  illustrated  by  the 
example  below.  Consider  an  object  seen  from  some  view 
t»i.  Its  distance  to  the  prototype  is  given  by  D{P,vi). 
Suppose  the  object  is  now  seen  from  a  new  view  V2  that 
is  identical  to  vi ,  except  that  the  object  is  now  as  twice 
as  close  to  the  camera.  Under  these  conditions  rJj  =  2i7i, 
and  its  distance  to  the  prototype  is  given  by  D(P,  €2)  = 
2D{P,vi).  Cleeirly,  we  should  have  a  measure  that  is 
independent  of  the  distance  of  the  object  to  the  camera. 
One  way  to  obtain  such  a  measure  is  by  dividing  D{P,  v) 
by  the  norm  l|vl| 


D(P,tO  = 


IKPP-^  -  /)i;]| 

ll«1l 


(12) 


£)(  P,  v)  is  proposed  here  as  an  objective  function  for 
establishing  the  correspondence  between  the  prototype 
and  the  image.  In  other  words,  we  expect  that  if  the  ob¬ 
ject  belongs  to  the  prototype’s  class  then  D(P,  v)  obtains 
its  minimal  v^llue  when  v  is  ordered  in  correspondence  to 
P.  Any  other  permutation  will  increase  the  value  of  D. 
Formally,  denote  by  (T  a  permutation  matrix,  we  assume 
that 

D{P,v)  =  min  D(P,av)  (13) 

O 

The  measure  D(P,  v)  has  a  second  role.  Since  it  mea^ 
sures  the  similarity  between  the  prototype  and  the  im¬ 
age,  it  can  also  be  used  to  determine  the  object’s  class. 
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Figure  1:  “Natural"  correspondences  between  two  chairs 


An  object  observed  in  a  view  v  belongs  to  the  class  rep¬ 
resented  by  a  prototype  P  if 

DiP,v)<<  (14) 

for  some  constant  e  >  0.  We  refer  to  (14)  as  the  catego- 
rizaiion  criterion. 

The  categorization  stage  proceeds  as  follows.  Given 
an  image  7  and  a  prototype  P,  the  correspondence  be¬ 
tween  P  and  7  is  resolved  by  minimizing  the  measure 
D{  P,  (Tv)  over  all  possible  permutation  <r  of  v,  and  if  the 
obtained  minimum  D(P,  v)  is  below  the  threshold  e,  then 
the  class  identity  of  the  object  is  determined. 

Note  that  in  our  scheme  the  prototype  and  the  cate¬ 
gorization  criterion  determine  the  actual  division  of  ob¬ 
jects  to  cleisses;  an  object  belongs  to  a  certain  class  if 
its  views  are  sufficiently  similar,  according  to  the  cate¬ 
gorization  criterion,  to  views  of  the  prototype.  Under 
the  above  definition,  an  object  belongs  to  a  prototype’s 
class  if  the  total  difference  between  its  feature  points  and 
their  corresponding  prototype  points  does  not  exceed  e. 

The  measure  D{  P,  tJ)  defined  here  determines  the  sim¬ 
ilarity  between  the  prototype  P  and  the  view  v  using 
only  the  distances  between  feature  points.  In  general, 
since  correspondence  is  difficult  to  achieve,  such  a  mea¬ 
sure  would  not  be  robust.  Including  additional  informar 
tion  about  the  features  in  the  similarity  measure  may 
increase  the  robustness  of  the  scheme.  Also,  measures 
that  consider  only  the  proximity  of  feature  points  are 
limited  in  terms  of  dividing  the  library  into  classes,  since 
they  induce  classes  of  objects  with  highly  similar  shapes. 
Measures  that  consider  additional  information  can  ex¬ 
tend  the  classes  to  include  larger  sets  of  objects. 

The  measure  D{P,v)  can  be  enriched  by  considering 
the  similarity  between  corresponding  points.  A  simple 


example  for  a  measure  that  considers  both  the  proxim¬ 
ity  eind  similarity  between  feature  points  is  the  following 
measure.  Each  feature  point  is  associated  with  a  la¬ 
bel  (such  as  a  corner  or  an  inflection  point).  Again,  the 
measure  D{P,  v)  is  applied,  but  this  time  only  correspon¬ 
dences  between  points  with  similar  labels  are  allowed; 
namely,  corners  in  the  image  can  only  match  corners  in 
the  prototype,  and,  simileirly,  inflection  points  can  only 
match  inflection  points.  Other  examples  for  measures 
that  combine  proximity  euid  similarity  include  measures 
that  retain  the  tangent  or  the  curvature  of  points.  More 
sophisticated  measures  may  compare  the  topologies  of 
the  objects  in  the  two  views,  or,  in  other  words,  verify 
that  the  objects  share  similar  part  structures  in  2D. 

A  useful  technique  in  measuring  the  similarity  be¬ 
tween  the  image  and  the  nearest  prototype  view  is  to 
consider  a  different  set  of  features  than  the  set  used  to 
determine  the  prototype  tremsform.  The  ration2il  behind 
this  technique  is  that  it  is  generally  difficult  to  recover 
exact  feature-to-feature  correspondence,  and  while  such 
correspondences  are  necessary  for  recovering  the  proto¬ 
type  transform,  similarity  measures  can  be  successfully 
applied  even  in  the  absence  of  exact  feature-to-feature 
correspondence.  This  idea  resembles  the  basic  principle 
of  the  alignment  algorithm  [18,  41],  in  which  a  small  set 
of  points  is  used  to  compute  the  object  pose,  while  a 
I^lrger  set  of  points  is  used  to  verify  this  pose. 

It  should  be  noted  that  the  general  flow  of  the  scheme 
and,  in  particular,  the  identification  stage  are  indepen¬ 
dent  of  the  specific  choice  of  similarity  measure.  As  has 
been  noted  above,  the  measure  affects  the  division  of 
model  libraries  into  classes  and  the  selection  of  optimal 
prototypes  for  these  classes.  An  example  for  selecting 
the  optimal  prototype  for  a  given  class  under  the  mea- 
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sure  specified  in  (12)  (for  either  labeled  or  unlabeled  fea¬ 
tures)  is  described  in  Section  4. 

Finally,  although  the  main  objective  of  the  categoriza¬ 
tion  stage  is  to  determine  the  class  identity  of  the  object, 
the  categorization  scheme  described  above  is  useful  even 
if  the  object’s  category  cannot  be  determined.  Section 
.3.3  below  shows  that  the  prototype  transform  can  be 
reused  to  align  the  image  with  the  specific  models.  Con¬ 
sequently,  following  the  categorization  stage  the  cost  of 
comparing  the  image  to  each  of  the  specific  models  is 
substantially  reduced  since  the  difficult  part  of  recover¬ 
ing  the  transformation  that  relates  the  models  to  the 
image  is  applied  only  to  the  prototype  objects.  As  a  re¬ 
sult,  if  the  cltiss  identity  of  the  object  is  not  determined 
we  still  need  to  consider  all  the  specific  models  in  th? 
library,  but  the  overall  cost  of  this  process  would  be  low 
because  correspondence  is  computed  once  for  the  whole 
class. 

3.3  Identification 

After  the  observed  object  is  categorized,  the  system 
turns  to  recovering  its  individual  identity.  At  this  stage 
the  image  is  matched  to  all  the  models  in  the  object’s 
class.  For  each  model,  the  system  seeks  to  recover  the 
transformation  that  aligns  the  model  to  the  image,  if 
there  is  such.  In  previous  schemes  this  required  recover¬ 
ing  the  correspondence  between  the  image  and  each  of 
the  models  separately.  In  our  scheme,  however,  this  no 
longer  is  necessary,  since  the  object  transform  is  deter¬ 
mined  directly  from  the  prototype  transform.  We  show 
in  this  section  that  the  prototype  and  the  object  trans¬ 
forms  are  related  by  a  simple  transformation,  which  can 
be  computed  in  advance,  eind  which  can  in  fact  be  un¬ 
done  already  in  the  library  of  stored  models.  Conse¬ 
quently,  the  prototype  transform  can  be  reused  in  the 
identification  stage  to  align  the  individual  models  with 
the  image. 

The  initial  stage  of  categorization  recovers  three 
pieces  of  information  that  can  be  used  for  identification. 
The  three  are  (i)  the  object  class,  (ii)  the  correspon¬ 
dence  between  the  prototype  and  the  image,  and  (iii) 
the  prototype  transform.  This  information  is  used  in 
the  identification  stage  as  follows.  First,  since  the  ob¬ 
ject’s  class  is  determined,  only  models  that  belong  to 
this  class  are  considered.  Second,  using  the  correspon¬ 
dence  between  the  prototype  and  the  image  established 
in  the  categorization  stage,  and  using  the  stored  corre¬ 
spondence  between  the  prototype  and  the  object  models, 
the  correspondence  between  the  models  and  the  image 
is  immediately  recovered.  Finally,  as  is  shown  below, 
the  model  transform,  namely,  the  transformation  that 
aligns  the  model  with  the  image,  is  recovered  from  the 
prototype  transform. 

Assume  we  are  given  with  a  view  v  of  some  object 
model  Mi,  namely 

V  =  Af.a  (15) 

for  some  transform  vector  a.  When  the  identification 
process  begins,  it  is  still  unknown  which  of  the  models 
Ml,...,  Ml  of  the  object’s  class  accounts  for  the  image 
and  what  the  transform  vector  a  is.  The  first  task  faced 
by  the  scheme  at  this  stage  is  to  recover  the  model  trans¬ 


form,  a.  This  is  done,  as  is  explained  below,  using  tin- 
prototype  transform  d  =  defined  in  (9)  Onn-  <7  is 

recovered,  it  is  applied  to  all  the  models  .\/i . l/i  and 

the  model  for  which  a  near-perfect  match  is  obtained 
determines  the  object’s  identity. 

Theorem  1  below  establishes  that  the  model  transform 
a  can  be  recovered  directly  from  the  prototype  transform 
b  by  applying  a  linear  transformation  which  is  referred  to 
as  the  prototype-to-model  transform.  This  transforni  has 
two  interesting  properties.  First,  it  is  view-independent; 
namely,  for  any  given  view  of  the  object,  the  same  trans¬ 
form  maps  the  prototype  transform  that  corresponds  to 
this  view  to  the  correct  model  transform.  The  prototype- 
to-model  transform  therefore  can  be  computed  in  ad¬ 
vance  and  stored  in  the  library  of  models.  Second,  the 
prototype-to-model  transform  can  be  used  to  recover  the 
model  transform  regardless  of  the  quality  of  match  be¬ 
tween  the  prototype  and  the  image.  In  other  words, 
even  if  the  prototype  aligns  poorly  with  the  image,  the 
transformation  that  aligns  the  model  with  the  image  is 
determined  correctly  in  this  process. 

Theorem  1:  Given  a  view  ii  =  Af,5.  Let  b  =  P^v 
be  the  prototype  transform,  that  is.  the  transform  vec¬ 
tor  that  best  aligns  the  prototype  with  the  image.  The 
model  transform,  a,  can  be  recovered  from  the  prototype 
transform,  b,  by  applying  a  matrix  Ai,  namely 

a  =  Aib 

Ai  is  referred  to  as  the  prototype-to-model  transform. 
Proof:  Notice  that 

b=  P+v=  P^Mid 
Assume  P'^Mi  is  invertible,  let 

A.  =(P+M,)-' 

we  obtain  that 

a  =  Aib 

a 

Corollary  2:  The  prototype-to-model  transform  is 

view-in  dependen  t . 

Proof:  The  prototype-to-model  transform,  Ai,  is  in¬ 

dependent  of  both  pose  vectors,  a  and  6.  Changing  the 
image  v  will  result  in  a  new  pair  of  pose  vectors,  a  and 
6,  but  similar  to  the  old  pair,  the  new  pair  is  related 
through  the  same  transform  A,  .  The  prototype-to-model 
transform  Aj  therefore  can  be  used  to  recover  the  object 
pose  for  any  view  of  Mi.  □ 

Ai  exists  if  P'^Mi  is  invertible.  This  condition  is 
equivalent  to  requiring  that  the  two  column  spaces  of 
P  2ind  Mi  will  not  be  orthogonal  in  any  direction.  The 
condition  holds,  in  general,  when  the  two  objects  are 
fairly  similar.  This  is  illustrated  by  the  following  ex¬ 
ample.  Consider  the  case  that  both  column  spaces  of 
P  and  Mi  are  one-dimensional;  namely,  each  represents 
a  line  through  the  origin.  The  only  case  in  this  one¬ 
dimensional  example  in  which  A,  does  not  exist  is  when 
P  and  Mi  are  orthogonal.  But  these  lines  are  farthest 
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apart  when  they  are  orthogonal.  Consequently,  if  the 
objects  are  relatively  similar  /I,  would  exist. 

Since  it  depends  only  on  the  prototype  P  and  the 
model  A/,,  the  prototype-to-model  transform  i4,  can  be 
pre-computed  and  stored  in  the  library  of  models.  Every 
model  A/,  €  C  is  dissociated  with  its  own  transform  A, 
that  relates,  for  every  possible  view  of  A/,,  between  the 
prototype  transform  and  the  model  transform.  To  com¬ 
pare  the  image  to  the  model  A/,  the  model  transform 
should  first  be  recovered.  This  is  achieved  by  applying 
d4,  to  the  prototype  transform  computed  in  the  catego¬ 
rization  stage. 

Also,  the  prototype-to-model  transform,  A,,  can  be 
used  to  align  the  model  Af,  with  the  prototype  P  in  3D. 
Denote  the  aligned  model  by  M^,  A/,'  models  the  same 
object  as  M  does,  since  their  column  vectors  span  the 
same  space.  In  addition,  the  aligned  model  Ml  has  the 
property  that  it  is  brought  by  the  prototype  transform,  b 
to  a  perfect  alignment  with  the  image.  Consequently,  if 
the  models  are  aligned  in  the  library  with  the  prototype, 
the  prototype  transform  computed  in  the  categorization 
stage  can  be  reused  for  identification  with  no  further 
manipulations.  This  is  established  in  Theorem  3  below. 

Theorem  3:  Let  Ml  =  AfjAi  be  the  mode]  Mi  aligned 
with  the  prototype  P.  For  any  view  v  =  Mi3,  the  proto¬ 
type  transform  for  this  view  b  =  P'^v  is  identical  to  the 
model  transform  for  this  view;  that  is,  v  =  MIb. 

Proof:  Since 

Ml  =  MiAi 

we  obtain  that 

Mib  =  MiAib  =  MiS  = 

□ 

Using  Theorem  3,  the  identification  scheme  is  sim¬ 
plified  as  follows.  The  models  M\,  ...,Mi  are  aligned  in 
the  library  with  the  prototype  P  by  applying  the  cor¬ 
responding  prototype-to-model  transform,  Ai , ...,  A|.  At 
recognition  time,  the  prototype  transform  6  =  P'^v,  is 
applied  to  the  aligned  models  Af{, ...,  A//.  According  to 
Theorems  1  and  3,  by  transforming  the  models  by  6  the 
correct  model,  Af,',  would  perfectly  align  with  the  image. 

In  the  scheme  above  we  assumed  that  full  feature-to- 
feature  correspondence  is  established  between  the  proto¬ 
type  and  the  image.  This  assumption  is  not  mandatory. 
Methods  for  estimating  the  prototype  transform  using 
partial  correspondence  or  by  considering  other  types  of 
features  (such  as  line  segments)  can  also  be  used.  Note 
that  in  case  the  prototype  transform  can  only  be  approx¬ 
imated,  the  quality  of  this  approximation  as  well  as  the 
condition  number  of  the  prototype-to-model  transform 
Ai  determine  the  accuracy  of  the  model  treinsform  ob¬ 
tained.  The  condition  number  of  A,  affects  the  match 
even  if  Theorem  3  is  applied,  namely,  even  if  the  mod¬ 
els  are  aligned  with  the  prototype  in  advance.  Conse¬ 
quently,  the  condition  number  of  the  prototype-to-model 
transform  A,  should  be  taken  into  account  when  the  li¬ 
brary  is  divided  into  classes. 

Finally,  the  scheme  can  be  extended  to  handle  classes 
of  objects  with  different  degrees  of  freedom.  Consider, 


for  instance,  the  case  of  similar  chairs,  some  of  which  art- 
folding.  Obviously,  the  folding  chairs  have  more  degree^ 
of  freedom  than  the  regular,  rigid  chairs,  and  therefore 
they  would  be  represented  in  the  library  by  wider  ma¬ 
trices  than  the  rigid  chairs  are.  As  is  explained  below 
the  cnairs  can  be  handled  in  a  common  class,  and  the 
prototype  for  the  class  would  itself  be  a  folding  chair 

More  generally,  let  A/j , ....  A//  be  a  class  of  models  of 

different  widths,  and  denote  by  . ki  the  width  of 

Ml,...,  Ml  respectively.  Let  P  be  the  prototype  f'^'-  this 
class,  and  denote  by  kp  the  width  of  P,  we  set  kp  to  be 

Up  =  max{l:i , ....  ki}  (16) 

In  other  words,  we  require  the  prototype  to  have  the 
same  degrees  of  freedom  as  the  most  flexible  object  in 
the  class.  We  can  set  kp  according  to  our  goal  since,  as  it 
is  shown  in  Section  4,  the  prototype  P  is  obtained  in  our 
scheme  by  manipulating  the  objects  in  the  class.  The 
prototype-to-model  transform  Ai  is  defined  in  this  case 

by 

A.  =  {P+M,)^  (17) 

where  Ai  is  kp  x  jb,.  It  is  straightforward  to  extend  The¬ 
orem  1  to  tdso  include  this  case.  Consequently,  for  any 
view  of  Mi ,  the  model  transform  a  can  be  recovered  from 
its  corresponding  prototype  transform  b  by  applying  the 
prototype-to-model  transform  Ai  to  6.  Note  that  since 
kp  >  ki  the  prototype  can  appear  in  poses  that  do  not 
match  any  possible  model  pose  (and  therefore  in  noise¬ 
less  conditions  they  are  impossible  to  obtain).  In  case 
the  object  is  observed  from  such  a  view,  A,  would  map 
this  unmatched  prototype  transform  to  the  model  trans¬ 
form  that  corresponds  to  the  nearest  matched  prototype 
transform.  By  setting  kp  to  be  as  large  as  the  maximum 
of  iki, ...,  ki  we  avoid  cases  where  there  exbt  views  of  the 
object  that  cannot  be  accounted  for  by  the  prototype. 
Model  transforms  that  correspond  to  such  views  cannot 
be  recovered  from  prototype  transforms. 

3.4  Summary 

We  presented  in  this  section  a  scheme  for  recognizing  3D 
objects  from  single  2D  views  that  proceeds  in  two  stages, 
categorization  and  identification.  In  the  categorization 
stage  the  image  is  compared  against  the  stored  proto¬ 
types.  For  every  prototype,  the  correspondence  between 
the  image  and  the  prototype  is  recovered,  and  the  near¬ 
est  view  of  the  prototype  is  constructed.  The  similarity 
between  this  view  and  the  image  is  evaluated,  and,  if  the 
two  are  found  similar,  the  class  identity  of  the  object,  is 
determined.  In  the  identification  stage  the  observed  ob¬ 
ject  is  compared  ag^nst  the  models  of  its  class.  Since 
the  prototype  and  the  models  were  brought  in  the  library 
into  alignment,  the  same  transformation  that  aligns  the 
prototype  to  the  image  also  aligns  the  object  model  to 
the  image.  The  prototype  transform  therefore  is  applied 
to  the  models,  and  the  obtained  views  are  compared  with 
the  image.  The  view  that  is  found  to  be  identical  up  to 
noise  and  occlusion  to  the  image  determines  the  individ¬ 
ual  identity  of  the  object. 

The  presented  scheme  is  based  on  several  key  princi¬ 
pals.  Recognition  is  divided  into  two  sub-processes,  cat¬ 
egorization  and  identification.  In  both  processes  mod- 
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els  are  aligned  with  the  image,  and  the  identity  of  the 
object  is  determined  by  a  ‘2D  comparison;  ZD  recon¬ 
struction  of  the  observed  object  from  the  image  is  not 
performed.  The  difficult  component  of  the  alignment 
approach,  namely,  the  recovery  of  correspondence  and 
object  pose,  is  performed  only  once  for  each  lass;  the 
prototype  transform  is  reused  in  the  identification  stage 
to  align  the  image  with  the  individual  models. 

4  Constructing  optimal  prototypes 

In  the  scheme  above  we  assumed  that  the  classes  in  the 
library  of  models  are  represented  by  prototype  objects. 
Since  categorization  is  achieved  by  matching  the  im¬ 
age  to  prototype  objects,  the  question  of  how  to  select 
the  best  prototype  should  be  addressed.  In  this  section 
we  present  an  algorithm  for  constructing  optimal  proto¬ 
types. 

Given  a  class  of  objects,  the  optimal  prototype  for 
this  class  is  the  object  that  resembles  the  objects  of  the 
class  the  most.  Under  our  formulation,  such  an  object 
would  share  as  many  features  as  possible  with  the  ob¬ 
jects  of  its  class,  the  position  of  these  features  on  the 
prototype  would  be  as  close  as  possible  to  their  position 
on  the  objects,  and  the  prototype-to-model  transform 
for  these  objects  would  be  as  stable  as  possible.  Below 
we  show  that  the  optimal  prototype  caui  effectively  be 
computed  using  principal  component  analysis;  that  is, 
by  computing  the  dominant  eigenvectors  for  some  ma¬ 
trix  determined  by  the  models  of  the  class. 

Principal  component  analysis  often  is  used  in  clas¬ 
sification  problems  to  construct  classes  and  prototypes 
[11].  In  existing  applications,  an  object  is  represented  by 
a  point  in  some  high  dimensional  space,  where  each  com¬ 
ponent  of  this  point  contains  an  invariant  attribute  of  the 
object.  A  hyperpl2ine  in  that  space  represents  a  class  of 
objects.  The  goal  of  the  principal  component  analysis 
is,  given  a  set  of  points  (objects),  to  recover  the  class 
that  these  points  induce.  Our  case  is  somewhat  differ¬ 
ent.  In  our  case  an  object  is  represented  by  a  continuous 
linear  space  rather  than  by  a  point.  Whereas  the  use 
of  hyperplanes  in  other  schemes  often  is  arbitrary  and 
made  primarily  for  convenience,  their  use  in  our  scheme 
is  appropriate  following  the  linear  combination  scheme 
[42]  (see  Section  3.1). 

The  differences  outlined  above  also  imply  differences 
in  the  proof  that  principle  component  analysis  applies 
to  our  case.  We  show  below  that  the  optimal  prototype 
can  be  computed  by  principal  component  analysis.  The 
traditional  proof  needs  to  be  extended  since  in  our  case 
objects  are  represented  by  continuous  spaces  rather  than 
by  discrete  points. 

The  prototype  constructed  in  this  process  is  a  3D  ob¬ 
ject  obtained  by  manipulating  the  objects  in  its  class. 
To  allow  the  construction,  it  seems  as  if  the  objects  in 
the  class  should  first  be  brought  into  alignment.  In  par¬ 
ticular,  if  the  objects  are  represented  by  viewer-centered 
models  (that  is,  by  sets  of  their  views,  see  Section  3.1  for 
details),  the  different  objects  would  then  have  to  be  rep¬ 
resented  by  images  taken  from  similar  viewpoints.  Nev¬ 
ertheless,  the  process  presented  below  does  not  require 
an  initial  alignment  of  the  objects.  The  same  prototype 


IS  obtained  in  this  process  even  when  the  objects  are  i.  it 
aligned 

We  now  turn  to  constructing  the  optimal  prototypt 
First,  we  define  an  objective  function  Given  a  proto¬ 
type  P  and  an  object  model  M, ,  we  define  the  sinalarity 
between  P  and  A/,  as  follows.  Let  r,  be  a  view  of  M, 
we  measure  the  similarity  between  the  prototype  P  and 
the  view  v,  using  (12).  Then,  we  sum  the  measure  over 
all  possible  views  of  A/,.  Assuming  without  loss  of  gen¬ 
erality  that  ||{j',j|  =  1,  (14)  can  be  rewritten  as 

D(P.tr,)  =  ||(PP+  -  /).-j|i  (18) 

Without  loss  of  generality,  w-^  can  assume  that  the 
constructed  prototype,  P,  is  composed  of  orthonornial 
columns.  Note  that  an  overdeterniined  matrix  P  with 
orthonormcd  columns  satisfies  P"*"  =  P^.  We  can  there¬ 
fore  rewrite  ( 18)  as 

D(P,t;,)  =  ||(PP^-/)tr,l|  (19) 

The  distance  between  P  auid  the  model  A/,  is  now  given 
by  summing  D(P,v,)  over  all  unit-length  (to  eliminate 
scaling  effects)  views  of  Af, ,  naimely 

b(P,M,)=  I  \\iPP'^~I)v,\\  (20) 

•^11  c.  11=1 

To  obtciin  the  objective  function,  we  sum  these  distances 
over  all  models 


The  object  P  that  minimizes  this  function  is  defined  to 
be  the  optimal  prototype. 

Note  that  (21)  is  not  the  only  possible  objective  func¬ 
tion  for  this  purpose.  An  aliernative  “worst  case”  ap¬ 
proach  is  to  measure  the  distance  between  the  prototype 
to  the  farthest  model  in  the  class  (rather  than  summing 
this  distance  over  all  models).  Except  for  being  difficult 
to  compute,  this  measure  abo  ir  sensitive  to  “outlier” 
modeb. 

The  prototype  that  minimizes  (21)  can  be  constructed 
in  a  process  that  includes  the  following  steps. 

1.  To  simplify  the  process  we  assume  the  column  vec¬ 
tors  of  each  of  the  model  matrices  Afi,  (1  <  i  <  /), 
are  orthonormal.  (In  case  they  are  not,  we  first  ap- 
p'y  a  Gramschmidt  proce.ss  to  them.  Such  a  process 
obviously  does  not  alter  the  spawie  of  views  implied 
by  the  modeb.) 

2.  Build  the  n  x  n  symmetric  matrix 

I 

1=1 

3.  Find  the  k  dominant  eigenvectors  of  F.  The  opti- 
meil  matrix  P  b  constructed  from  these  eigenvec¬ 
tors. 

Note  that,  in  general,  we  are  trying  to  construct  a  pro¬ 
totype  object  that  would  belong  to  the  given  class  This 
condition  determines  the  choice  of  width  k  for  the  pro¬ 
totype.  If  aJl  the  models  share  the  same  width  then  the 
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prototype  would  assume  this  width.  In  the  rigid  case, 
for  example,  /(:  =  4  (see  Section  3.1).  As  mentioned  in 
Section  3.3  above,  in  case  the  objects  have  different  de¬ 
grees  of  freedom,  k  is  set  to  be  the  n  aximum  of  fcj . ki 

where  ki . ki  are  the  widths  of  M\ , ...,  Mi  respectively. 

In  case  more  than  k  large  eigenvedues  are  obtained,  one 
may  ignore  these  guideline  rules  and  construct  a  proto¬ 
type  that  h2is  higher  degrees  of  freedom  than  the  objects 
in  the  class  (see  for  example  [31]). 

Theorem  4  below  estaldishes  that  the  algorithm  above 
produces  the  optimal  prototype.  We  consider  here  the 
case  that  all  the  objects  share  similar  degrees  of  freedom. 
The  same  procedure  can  be  applied  with  slight  modifica 
tions  to  include  the  case  of  objects  with  different  degrees 
of  freedom. 

Tlieoreui  4:  Let  Mi,  M2,  ■■■,  Mi  be  a  set  of  models 
belonging  to  some  class  C.  Assume  every  model  M,  is 
represented  by  annxk  matrix  with  orthonormal  column 
vectors.  The  prototype  P  that  minimizes  the  term 

=  jzl 

where  the  integration  is  done  over  all  the  unit-length 
views  Vi  of  each  model  Mi,  is  composed  of  the  k  eigen¬ 
vectors  of  the  matrix 

F  =  Y,MiMT 

1=1 

that  correspond  to  its  k  largest  eigenvalues. 

Proof:  Let  P  be  composed  of  the  k  dominant  eigen¬ 

vectors  of  F.  According  to  regression  principles  P  min¬ 
imizes  the  term 

1=1 ;=1 

where  fhij  is  the  j’th  column  vector  of  Mi.  In  other 
words,  consider  as  a  point  in  72".  The  space  spanned 
by  the  column  vectors  of  P  is  the  nearest  h-dimensional 
hyperplane  to  these  points,  fhij .  The  rest  of  this  proof 
extends  the  claim  from  the  discrete  sum  over  the  column 
vectors  of  A/,-  to  the  continuous  integral  over  all  views 
spanned  by  these  vectors.  According  to  our  assumptions, 
each  matrix  M,-  contains  an  orthonormal  set  of  column 
vectors.  Replacing  these  vectors  by  another  orthonormal 
basis  for  Mi  will  not  change  the  matrix  P;  that  is,  P  is 
independent  cf  the  choice  of  orthonormal  basis  for  the 
models.  This  is  illustrated  by  the  following  derivation. 
To  obtain  a  new  orthonormal  basis  for  the  column  space 
of  Mi  we  can  apply  a  k  x  k  rotation  matrix  R  to  Mi 
(namely,  MiR).  P  is  the  best  vector  space  for  the  new 
set  as  well,  since 

MiR(MiRf  =  MiRR^  MJ  =  MilMj  =  MfMj 

F  therefore  is  constant  for  any  choice  of  orthonormal  vec¬ 
tors  for  M\ , ...,  Mn,  and  so  its  dominant  eigenvectors  rep¬ 
resent  the  best  vector  space  for  for  any  orthonormal  rep¬ 
resentation  of  the  objects.  Consequently,  P  minimizes 
the  objective  function  regardless  of  choice  of  basis  for 


the  models,  and  therefore  it  also  minimize.',  the  rei|Uireil 
term 

E(P)  =  YI  !1(pp’ - /)r.|! 

y|ic.ii=i 

D 

To  summarize,  we  showed  that  given  a  class  of  object 
models,  the  optimal  prototype  for  this  class  is  given  by 
the  dominant  eigenvectors  of  the  matrix  F,  which  is  con¬ 
structed  from  the  object  models.  Note  that  in  proving 
Theorem  4  we  showed  that  the  prototype  is  independent 
of  choice  of  basis  for  the  models.  This  implies  that,  in 
order  to  construct  the  prototype,  the  object  models  A/i. 
...,  Ml  do  not  need  to  first  be  brought  into  alignment. 
The  process  above  guarantees  to  output  the  same  pro¬ 
totype  object  even  if  the  models  are  not  aligned. 

5  Relevance  to  human  vision 

The  recognition  by  prototypes  scheme  uses  the  general 
shape  of  objects  as  the  cue  for  recognizing  them.  As  was 
already  mentioned,  classes  in  our  scheme  contain  objects 
with  fairly  similar  shapes.  In  contrast,  the  human  vi¬ 
sual  system  recognizes  objects  using  both  shape  cues  as 
well  as  many  other  cues,  such  as  color,  texture,  motion, 
and  context,  and  objects  are  categorized  in  their  basic 
level  of  abstraction  [33].  Only  little  is  currently  known 
about  the  underlying  processes  for  recognition  used  by 
the  visual  system.  From  what  is  known,  in  spite  of  the 
differences  pointed  above,  the  recognition  by  prototypes 
scheme  seems  to  be  consistent  in  several  key  issues  with 
psychological  and  physiological  findings.  In  this  section 
we  briefly  review  these  findings. 

The  scheme  presented  in  this  paper  promotes  the  no¬ 
tion  that  categorization  and  identification  are  performed 
using  similar  tools.  In  both  cases  view  variations  first 
are  compensated  for,  and  then  a  view  of  either  the  hy¬ 
pothesized  prototype  or  object  model  is  compared  with 
the  image.  This  is  in  contrast  to  methods  (such  as  part 
decomposition  and  functional  description)  that  in  gen¬ 
eral  handle  either  categorization  or  identification,  but 
do  not  extend  to  dea'  ..ith  both  problems.  The  avail¬ 
able  studies  in  this  case  are  inconclusive.  Some  evidence 
seem  to  indicate  that  the  two  processes  are  handled  sepa¬ 
rately  by  the  visual  system.  Agnostic  and  prosopagnostic 
patients  often  demonstrate  degraded  identification  abili¬ 
ties,  whereaus  their  performance  in  categorization  remains 
intact.  Double  dissociation  between  the  two  processes, 
however,  has  not  been  found,  auid  so  the  assumption  that 
the  two  processes  are  handled  separately  in  the  brain  has 
not  been  established.  In  fact,  both  cells  that  respond 
to  general  faces  as  well  as  cells  that  respond  to  specific 
faces  where  found  lying  side  by  side  within  the  same 
brain  area,  STS,  of  the  m2icaque  monkey  [29].  The  vul¬ 
nerability  of  the  identification  process  to  brain  lessions 
can  be  expleuned  by  that  the  process  requires  a  relatively 
large  memory  to  encode  the  detailed  shapes  of  objects  as 
well  as  sophisticated  image  processing  mechanisms  to  re¬ 
cover  a  detailed  description  of  the  observed  object  from 
the  image  (see  e.g.,  [19]). 

Another  idea  proposed  here  is  that  categorization  in¬ 
volves  two  stages:  a  stage  of  compensating  for  view  vari- 
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alioas  followed  by  a  stage  of  2D  comparison  to  account 
for  shape  differences.  A  decoupling  of  view  variation 
and  semantic  categorization  was  suggested  by  Lissauer 
[24],  Warrington  and  Taylor  [44,  45]  found  that  pa¬ 
tients  that  suffer  , _  lessions  in  the  posterior  lobe  of 

the  right  hemisphere  demonstrate  difficulties  in  catego¬ 
rizing  objects  from  unconventional  views,  whereas  their 
performance  in  categorization  of  objects  from  conven¬ 
tional  views  remains  intact.  Additional  evidence  for  the 
effect  of  view  variations  on  categorization  performsmce 
were  found  for  healthy  subjects.  Subjects  that  are  asked 
to  name  objects  respond  slower  when  the  objects  ap¬ 
pear  in  unconventional  views  [28].  Also,  mental  rotation 
effects,  namely,  response  time  that  grows  linearly  with 
the  tilt  of  ‘‘^e  object,  were  observed  in  naming  tasks  of 
natural  objects  [21]. 

Finally,  the  process  of  categorization  presented  here 
is  achieved  by  comparing  the  image  to  prototype  ob 
jects,  cind  these  prototype  objects  can  be  constructed  by 
manipulating  the  familiar  objects  of  the  class.  Recent 
studies  indicate  that  response  time  in  naming  tasks  is 
typically  shorter  and  error  rates  are  lower  when  the  ob¬ 
served  object  is  similar  to  the  prototype  [5].  Similarly, 
shorter  reaction  time  is  obtained  when  subjects  are  asked 
to  answer  questions  of  the  type  “does  the  object  X  be¬ 
long  to  the  class  Y?”  [34],  Other  studies  reported  that 
children  learn  good  examples  of  classes  before  they  learn 
poor  ones  [1,  32]  and  that  subjects  recall  having  seen 
the  prototype  or  average  configuration  of  studied  face 
images  even  if  this  configuration  was  not  studied  [8]. 

To  summarize,  although  the  presented  scheme  gen- 
ertJly  does  not  recognize  objects  in  their  basic  level  of 
abstraction,  it  is  consistent  with  psychological  and  phys¬ 
iological  findings  in  several  key  issues  including  a  single 
approach  for  the  two  sub-problems  of  recognition,  cat¬ 
egorization  and  identification,  view  dependency  of  the 
two  sub-processes,  and  the  role  of  prototypes  in  catego¬ 
rization.  The  findings  discussed  here  obviously  are  in¬ 
conclusive,  since  psychological  and  physiological  studies 
including  the  ones  discussed  here  have  more  than  one 
possible  interpretation. 

6  Implementation 

To  test  the  ideas  presented  in  the  paper,  we  have  imple¬ 
mented  the  scheme  and  applied  it  to  several  objects.  In 
our  implementation,  the  library  of  models  included  two 
classes.  The  first  (Figure  2)  contained  two  four-legged 
chairs  (denoted  by  A  2ind  B),  and  the  second  (Figure  3) 
included  two  car  models,  a  VVV  and  a  Saab. 

To  demonstrate  categorization,  we  used  chair  A  as  a 
prototype  and  matched  it  to  an  image  of  chair  B.  Corre¬ 
spondences  between  the  prototype  and  the  image  were 
picked  manually,  and,  using  these  correspondences,  the 
prototype  transform  was  recovered  and  applied  to  the 
prototype.  The  results  of  matching  ‘he  transformed  pro¬ 
totype  with  the  image  are  seen  in  Figure  4.  It  can  be  seen 
that  the  transformed  prototype  (middle  figure)  assumed 
the  same  orientation  as  the  observed  object  (left  figure), 
and  that  the  match  between  the  two  is  good  considering 
that  the  objects  have  different  shapes.  Note  that  in  this 
implementation  we  allowed  the  objects  to  undergo  gen¬ 


eral  affine  transformations  in  3D.  including  stretch  and 
shear,  and  so  the  match  between  the  prototype  and  the 
imag  was  better  than  if  only  rigid  transformations  wert 
allowed.  Additional  exampl.  using  chair  B  and  the  two 
cars  as  the  prototypes  are  shown  in  Figures  5-7 

In  Figures  8-9  we  tried  to  match  the  prototypes  to  the 
images  with  wrong  correspondences.  The  results  of  lhe.se 
matches  were  significantly  worse  than  when  the  correct 
matches  were  used.  This  is  consistent  with  the  idea  dis¬ 
cussed  ill  Section  3.2  that  the  quality  of  the  match  can 
be  used  as  the  objective  function  for  resolving  the  correct 
correspondence. 

Figure  lO  shows  the  results  of  matching  a  prototype 
four-legged  chair  to  a  single-legged  office  chair  It  can 
be  seen  that  the  upper  portions  of  the  chairs  match  rel¬ 
atively  well,  while  the  legs  of  the  chairs  do  not  find  ap¬ 
propriate  matches. 

Figure  11  shows  the  result  of  matching  a  prototype 
chair  to  an  image  of  a  Saab  car.  As  an  anecdotal  ex¬ 
ample,  we  matched  the  hole  below  the  back  of  the  chair 
to  the  windshield  of  the  car  and  the  seat  to  the  hood. 
In  general,  whatever  correspondence  is  used,  the  two  ob¬ 
jects  would  match  poorly  relative  to  matching  the  pro- 
tot  pes  to  objects  of  their  class. 

Figures  12-13  demonstrate  the  identification  stage.  In 
the  library  we  first  aligned  the  model  for  chair  A  with 
the  p'-ototype  chair  (chair  B)  using  the  prototyp  -to 
model  transform.  Then,  an  imag''  of  chair  A  was  cate¬ 
gorized  (Figure  5)  bv  matching  it  to  the  prototype  chair, 
and  the  prototype  transform  was  computed.  In  the  next 
step,  the  prototype  transform  was  applied  to  the  specific 
model  of  chair  A.  The  result  of  this  application  is  seen 
in  Figure  12.  It  can  be  seen  thai  a  near-perfect  dign- 
ment  was  achieved  in  this  process.  A  similar  process 
was  applied  to  the  VW  car  in  Figure  13  using  the  ,Saab 
car  as  the  prototype.  (The  result  of  the  coriesponding 
categorization  stage  was  shown  in  Figure  6.)  These  fig¬ 
ures  demonstrate  that  although  a  perfect  match  between 
the  prototype  and  the  image  could  not  been  obtained, 
the  prototype  transform  can  still  be  used  to  align  the 
observed  object  with  its  specific  model. 

7  Summary 

We  introduced  in  this  paper  a  recognition  scheme  that 
proceeds  in  two  stages;  categorization  and  identification. 
Categorization  is  achieved  by  aligning  the  image  to  pro¬ 
totype  objects.  For  every  prototype,  the  nearest  proto¬ 
type  view  is  recovered,  and  the  similarity  between  this 
view  and  the  image  is  evaluated.  The  prototype  that 
most  resembles  the  observed  object  determines  its  class 
identity.  Likewise,  identification  is  achieved  by  align¬ 
ing  the  observed  object  to  the  individual  models  of  its 
class.  At  this  stage  the  prototype  transform  computed 
in  the  categorization  stage  is  reused  to  align  the  models 
with  the  image.  The  model  that  matches  the  observed 
object  determines  its  specific  identity.  In  addition,  we 
presented  an  algorithm  for  constructing  the  optimal  pro¬ 
totypes  and  discussed  the  relevance  of  the  scheme  to  hu- 
mein  recognitioi 

An  importeint  issue  conveyed  by  our  scheme  is  that 
categorization  can  be  used  to  facilitate  the  identification 
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Figure  3:  Pictures  of  two  cars  used  as  models.  Left:  a  VW  model.  Right:  a  Saab  model.  Models  for  the  two  cars  were 
borrowed  from  [42]. 


Figure  4:  Matching  a  prototype  chair  (chair  A)  to  an  image  of  chair  B.  This  figure,  as  well  as  the  rest  of  the  figures,  contain 
three  pictures.  Left:  the  image  to  be  recognized.  Middle:  the  appearance  of  the  prototype  following  the  application  of  the 
prototype  transform.  Right:  an  overlay  of  the  left  and  the  middle  pictures. 


Figure  1 1;  Matching  a  prototype  to  a  chair  (chair  A)  to  an  image  of  a  Saab  car. 


Figure  12:  Matching  a  model  of  chair  A  to  an  image  of  the  same  chair  using  the  prototype  transform  computed  in  the 
categorization  stage. 


Figure  13;  Matching  a  model  of  a  VW  car  to  an  image  of  the  same  car  using  the  prototype  transform  computed  in  the 
categorization  stage. 


of  objects.  We  showed  that  by  first  categorizing  the  ob¬ 
ject,  the  difficult  stages  of  the  alignment  process,  namely, 
the  recovery  of  the  object  pose  and  the  correspondence 
between  the  image  and  the  model,  can  be  performed  only 
once  per  cltiss.  Consequently,  identification  is  reduced  in 
this  scheme  into  a  series  of  simple  template  comparisons. 

The  scheme  preset. ted  in  this  paper  differs  from  ex¬ 
isting  categorization  schemes  in  two  important  aspects. 
The  existing  schemes  (e.g.,  [4])  first  attempt  to  recover 
the  part  structure  (geons)  of  the  object  Com  the  image 
alone.  This  structure  is  assumed  to  be  almost  invari¬ 
ant  both  to  rotation  of  the  object  and  across  objects  of 
the  same  class.  In  contrast,  our  scheme  does  not  at¬ 
tempt  to  recover  any  3D  information  from  the  image 
alone.  Moreover,  it  separates  the  two  effects  that  deter¬ 
mine  the  object’s  appearance:  view  variation  effects  and 
deformations  due  to  class  variability.  View  variations  are 
compensated  for  by  recovering  the  view  of  the  prototype 
that  most  resembles  the  image,  and  the  amount  of  de¬ 
formation  that  separates  the  prototype  from  the  specific 
object  is  evaluated  by  assessing  the  difference  (in  2D) 
between  the  nearest  prototype  view  and  the  image. 

Open  problems  for  future  research  include  solving  the 
correspondence  between  prototypes  and  images,  defining 
effective  measures  to  evaluate  the  quality  of  matches, 
and  extending  the  system  to  incorporate  additional  cues, 
such  as  color  and  texture. 
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